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PREDICTION IN FUNCTIONAL LINEAR REGRESSION 

By T. Tony Cai 1 and Peter Hall 

University of Pennsylvania and Australian National University 

There has been substantial recent work on methods for estimat- 
ing the slope function in linear regression for functional data analysis. 
However, as in the case of more conventional finite-dimensional re- 
gression, much of the practical interest in the slope centers on its 
application for the purpose of prediction, rather than on its signifi- 
cance in its own right. We show that the problems of slope-function 
estimation, and of prediction from an estimator of the slope function, 
have very different characteristics. While the former is intrinsically 
nonparametric, the latter can be either nonparametric or semipara- 
metric. In particular, the optimal mean-square convergence rate of 
predictors is where n denotes sample size, if the predictand is 
a sufficiently smooth function. In other cases, convergence occurs at 
a polynomial rate that is strictly slower than n^ 1 . At the boundary 
between these two regimes, the mean-square convergence rate is less 
than n -1 by only a logarithmic factor. More generally, the rate of 
convergence of the predicted value of the mean response in the re- 
gression model, given a particular value of the explanatory variable, 
is determined by a subtle interaction among the smoothness of the 
predictand, of the slope function in the model, and of the autocovari- 
ance function for the distribution of explanatory variables. 

1. Introduction. In the problem of functional linear regression we ob- 
serve data {(Xi, Yi), . . . , (X n ,Y n )}, where the Xj's are independent and 
identically distributed as a random function X , defined on an interval I, 
and the 1^'s are generated by the regression model, 
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Here, a is a constant, denoting the intercept in the model, and b is a square- 
integrable function on Z, representing the slope function. The majority of 
attention usually focuses on estimating b, typically by methods based on 
functional principal components. See, for example, [28], Chapter 10, and [29]. 

In functional linear regression, perhaps as distinct from more conventional 
linear regression, there is significant interest in b in its own right. In partic- 
ular, since b is a function rather than a scalar, then knowing where b takes 
large or small values provides information about where a future observation 
x of X will have greatest leverage on the value of J x bx. Such information can 
be very useful for understanding the role played by the functional explana- 
tory variable. Nevertheless, as this example suggests, the greatest overall 
interest lies, as in conventional linear regression, in using an estimator b as 
an aid to predicting, either qualitatively or quantitatively, a future value of 
f x bx. 

Thus, while there is extensive literature on properties of b, for example on 
convergence rates of b to b (see, e.g., [11, 13, 15, 20]), there is arguably a still 
greater need to understand the manner in which b should be constructed in 
order to optimize the prediction of J x bx, or of a + J x bx. This is the problem 
addressed in the present paper. 

Estimation of b is intrinsically an infinite-dimensional problem. There- 
fore, unlike slope estimation in conventional finite-dimensional regression, it 
involves smoothing or regularization. The smoothing step is used to reduce 
dimension, and the extent to which this should be done depends on the use 
to which the estimator of b will be put, as well as on the smoothness of b. It is 
in this way that the problem of estimating J x bx is quite different from that 
of estimating b. The operation of integration, in computing J x bx from b, 

confers additional smoothness, with the result that if we smooth b optimally 
for estimating b then it will usually be oversmoothed for estimating J x bx. 

Therefore the construction of b, as a prelude to estimating J x bx, should 
involve significant undersmoothing relative to the amount of smoothing that 
would be used if we wished only to estimate b itself. In fact, as we shall 
show, the degree of undersmoothing can be so great that it enables J x bx to 
be estimated root-n consistently, even though b itself could not be estimated 
at such a fast rate. 

However, root-n consistency is not always possible when estimating J x bx. 
The optimal convergence rate depends on a delicate balance among the 
smoothness of b, the smoothness of x, and the smoothness of the autoco- 
variance of the stochastic process X , all measured with respect to the same 
sequence of basis functions. In a qualitative sense, J x bx can be estimated 
root-n consistently if and only if x is sufficiently smooth relative to the 
degree of smoothness of the autocovariance. If x is less smooth than this, 
then the optimal rate at which J x bx can be estimated is determined jointly 



PREDICTION 



3 



by the smoothnesses of b, x and the autocovariance, and becomes faster as 
the smoothnesses of x and of b increase, and also as the smoothness of the 
covariance decreases. 

These results are made explicitly clear in Section 4, which gives upper 
bounds to rates of convergence for specific estimators of J x bx, and lower 
bounds (of the same order as the upper bounds) to rates of convergence for 
general estimators. Section 2 describes construction of the specific estimators 
of b, which are then substituted for b in the formula Jjbx. Practical choice 
of smoothing parameters is discussed in Section 3. 

In this brief account of the problem we have omitted mention of the role 
of the intercept, a, in the prediction problem. It turns out that from a 
theoretical viewpoint the role is minor. Given an estimator b of b, we can 
readily estimate a by a = Y — J x bX , where X and Y denote the means of 
the samples of Xi's and respectively. Taking this approach, it emerges 
that the rate of convergence of our estimator of a + J x bx is identical to that 
of our estimator of Jjbx, up to terms that converge to zero at the parametric 
rate n -1 / 2 . This point will be discussed in greater detail in Section 4.1. 

The approach taken in this paper to estimating b is based on functional 
principal components. While other methods could be used, the PC technique 
is currently the most popular. It goes back to work of Besse and Ramsay 
[1], Ramsay and Dalzell [27], Rice and Silverman [31] and Silverman [32, 
33]. There are a great many more recent contributions, including those of 
Brumback and Rice [5], Cardot [7], Cardot, Ferraty and Sarda [8, 9, 10], 
Girard [19], James, Hastie and Sugar [23], Boente and Fraiman [3] and He, 
Miiller and Wang [21]. 

Other recent work on regression for functional data includes that of Ferre 
and Yao [18], who introduced a functional version of sliced inverse regres- 
sion; Preda and Saporta [26], who discussed linear regression on clusters of 
functional data; Escabias, Aguilera and Valderrama [14] and Ratcliffe, Heller 
and Leader [30], who described applications of functional logistic regression; 
and Ferraty and Vieu [16, 17] and Masry [24], who addressed various aspects 
of nonparametric regression for functional data. Miiller and Stadtmuller [25] 
introduced the generalized functional linear model, where the response Y{ 
is a general smooth function of a + J x bXi, plus an error. See also [22] and 
[12]. The methods developed in the present paper could be extended to this 
setting. 

2. Model and estimators. We shall assume model (1.1), and suppose that 
the errors £i are independent and identically distributed with zero mean and 
finite variance. It will be assumed too that the errors are independent of the 
X^s and that J x E(X 2 ) < oo. 

Conventionally, estimation of b is undertaken using a principal compo- 
nents approach, as follows. We take the covariance function of X to be 



4 



T. T. CAI AND P. HALL 



positive definite, in which case it admits a spectral decomposition in terms 
of strictly positive eigenvalues Oj, 

oo 

(2.1) K(u,v) = cov{X(u),X(v)} = J2 9 i ( t ) j( u ) ( i ) j( v )' u,v£l, 

where (9j,4>j) are (eigenvalue, eigenfunction) pairs for the linear operator 
with kernel K, the eigenvalues are ordered so that 9\ > O2 > ■ ■ ■ (in particu- 
lar, we assume there are no ties among the eigenvalues), and the functions 
01,02) ■ • ■ form an orthonormal basis for the space of all square- integr able 
functions on Z. 

Empirical versions of K and of its spectral decomposition are 

— - 1 n 

K(u,v) = - J2i X ^ u ) ~ X(u)}{Xi(v) - X(v)} 

i=l 

00 

= ^Zhh( u )h( v )^ u,vei, 

where X = n~ l J2i Xi- Analogously to the case of K, (Oj,(j)j) are (eigenvalue, 
eigenfunction) pairs for the linear operator with kernel K, ordered such that 
§1 > §2 > ■ ■ ■ ■ Moreover, 9j = for j > n + 1. We take (9j,(f>j) to be our 
estimator of (9j,4>j). The function b can be expressed in terms of its Fourier 
series, as b = J2j>i ^j4>ji where bj = J bcfij. We estimate b as 

m 

(2.2) 6 = £Wi> 

where m, lying in the range 1 < m < n, denotes a "frequency cut-off" and 
bj is an estimator of bj. 

To construct bj we note that bj = 9~ 1 gj, where gj denotes the jth Fourier 
coefficient of g{u) = JjK(u,v)b(v) dv. A consistent estimator of g is given 
by 

1 n 

Tl 

and so, for 1 < j < m, we take bj = 9j 1 gj, where gj = J x g4>j. 

While the problem of estimating b is of intrinsic interest, it is arguably 
not of as much practical importance as that of prediction, that is, estimating 



p{x) = E{Y\X = x)=a + / bx 
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for a particular function x. To accomplish this task we require an estimator 
of a, 



Here, Y and e are the respective means of the sequences Y{ and £.;. Our 
estimator of p(x), for a given function x, is 



In Section 4 we shall introduce three parameters, a, (3 and 7, describing 
the smoothness of K, b and x, respectively. In each case, smoothness is mea- 
sured in the context of generalized Fourier expansions in the basis <f>i, </>%,.■ . 
, and the larger the value of the parameter, the smoother the associated 
function. We shall show in Theorem 4.1 that if x is sufficiently smooth rel- 
ative to K, specifically if 7 > |(a + 1), then Jjbx can be estimated root-n 
consistently. For smaller values of 7, the optimal convergence rate is slower 
than n" 1 / 2 . 

3. Numerical implementation and simulation study. There is a variety 
of possible approaches to empirical choice of the cut-off, m, although not all 
are directly suited to estimation of Jjbx. Potential methods include those 
based on simple least-squares, on the bootstrap or on cross-validation. In 
some instances where Jjbx is root-n consistent for Jjbx, m can be chosen 
within a wide range without appreciably affecting the performance of the 
estimator. Only in relatively "unsmooth" cases, where either 7 < |(o! + 1), 
or 7 > \ {a + 1) but 7 is close to ^(a + 1), is the choice of m rather critical. 
The empirical identification of unsmooth cases, and empirical choice of m 
in those instances, are challenging problems, and we shall not attempt to 
address them here. (See the last paragraph of Section 2 for discussion of a, 
(3 and 7.) 

Instead, we shall give below a simple threshold-based algorithm for choos- 
ing m empirically in cases where x is sufficiently smooth. There, the algo- 
rithm guarantees root-n consistency. The order of magnitude of the em- 
pirically chosen m depends very much on selection of the threshold, but 
nevertheless the estimator Jjbx remains root-?i consistent in a very wide 
range of cases. Therefore, the effectiveness of the threshold algorithm un- 
derscores the robustness of the estimator against choice of m in cases where 
x is smooth. 

To describe the threshold algorithm, let C > and < c < |, and put 
Ij = 1 if 9j > t = Cn~ c , with Ij = otherwise. Since the sequence 9\, O2, ■ ■ ■ 
is nonincreasing and 9j = for j > n + 1 , then I\ , I2 , ■ ■ ■ is a sequence of 
m, say, l's, followed by an infinite sequence of O's. Therefore the threshold 
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algorithm implicitly gives an empirical rule for choosing the cut-off, m. Our 
estimator of Jjbx is Jjbx, where b = J2i<j<m^j ( Pj- Note that the estimator 

J bx = j2ijbjXj= 

where xj = JjX<j)j. This form is often easier to use in numerical calculations. 

To appreciate the size of fh chosen by this rule, let us suppose that 9j = 
const .j ~ a . It can be shown that, for the specified range of values of c, 6j = 
const .j~ a {l + o p (l)} uniformly in 1 < j < fh + k, for each integer k > 1. 
Therefore, fh = const. n C//Q {l + o p (l)}. It follows that the order of magnitude 
of fh changes a great deal as we vary c. 

It can be proved too that, under the conditions of Theorem 4.1, and 
assuming that a > 2, 7 > |(o! + 2) and (3 + 7 > (a /2c) + 1, 

m „ 

(3.1) y]Mi= / ^ + O p (n~ 1 / 2 ). 

3=1 Jl 

This result demonstrates the root-n consistency of the estimator on the left- 
hand side, for a range of different orders of magnitude of fh. Of course, (3.1) 
continues to hold if the number of terms, fh, is replaced by a deterministic 
quantity, say m ~ const. n c l a . Note too that the conditions 7 > |(a + 2) 
and (3 + 7 > {a /2c) + 1 are both implied by 7 > max(3/2, l/2c)a + 3, which 
asserts simply that the function x is sufficiently smooth relative to K. 

The case where the functions Xi are observed on a regular grid of k points 
with additive white noise may be treated similarly. Indeed, it can be proved 
that if continuous approximations to the X^s are generated by passing a 
local- linear smoother through noisy, gridded data, and if we take c = i, 
then all the results discussed above remain true provided n = 0{k). That 
is, k should be of the same order as, or of larger order than, n. Details are 
given in the Appendix of [6]. Similar results are obtained using smoothing 
methods based on splines or orthogonal series. 

A simulation study was carried out to investigate the finite-sample per- 
formance of the thresholding procedure given above. The study considered 
the model (1.1) in two cases. In the first, the predictor Xi was observed 
continuously without error. Specifically, random samples of size n = 100 
were generated from the model (1.1), where 1= [0,1], the random functions 
Xi were distributed as X = J2j Zj2 1 / 2 cos(jnt), the Zj's were independent 
and normal N(0,4j -2 ), b = J2j :j~ 4 2 1 / 2 cos(j7rt), and the errors £j were in- 
dependent and normal N(0,4). The future observation of X was taken to 
be x = j~ 2 2 1 / 2 cos(jirt), in which case the conditional mean of y given 
X = x was 1.0141. 
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Table 1 

Comparison of average squared errors 



Threshold 0.001 0.01 0.05 0.1 0.15 0.2 

X continuous 0.026 0.019 0.015 0.014 0.013 0.015 

X discrete with noise 0.035 0.022 0.016 0.017 0.015 0.016 



The example in the second case was the same as that for the first, except 
that each Xi was observed discretely on an equally-spaced grid of 200 points 
with additive N(0, 1) random noise. We used an orthogonal-series smoother 
to "estimate" each Xi from the corresponding discrete data. Table 1 gives 
values of averaged squared error of the estimator of the conditional mean, 
computed by averaging 500 Monte Carlo simulations. It is clear from these 
results that the procedure is robust against discretization, random errors 
and choice of the threshold. 

Earlier in this section we discussed the robustness of b to choice of smooth- 
ing parameter in the prediction problem. This robustness is not shared in 
cases where b is of interest in its own right, rather than a tool for prediction. 
To make this comparison explicit, and to compare the levels of smooth- 
ing appropriate for prediction and estimation, we extended the simulation 
study above. We selected X as before, but took b= WJ2j J~ 2 2 1 / 2 cos(j7rt) 
and x = j 

j- 1 - 6 2 1 / 2 cos(i7rt). In the case of noisy, discrete observations we 
took the noise to be N(0, 1) and the grid to consist of 500 points. Sample 
size was n = 100. 

For the thresholds t = 0.001, 0.01, 0.05, 0.1, 0.15, 0.2 used to construct Ta- 
ble 1, mean squared prediction error was relatively constant; respective 
values were 0.013^0.008,0.007,0.010,0.015,0.022. However, mean integrated 
squared error of b was as high as 168 when t = 0.001, dropping to 6.67 at 
t = 0.01 and reaching its minimum, 0.639, at t = 0.1. Similar results were 
achieved in the case of noisy, discrete data; values of mean squared predic- 
tion error there were 0.014,0.008,0.009,0.013,0.019,0.028 for the respective 
values of t, and mean integrated squared error of b was elevated by about 
30% across the range, the minimum again occurring when t = 0.1. 

These results also indicate the advantages of undersmoothing when mak- 
ing predictions, as opposed to estimating b in its own right. In particular, 
the numerical value of the optimal threshold for prediction is a little less 
than that for estimating b. Discussion of theoretical aspects of this point 
will be given in Section 4. 

4. Convergence rates. 

4.1. Effect of the intercept, a. In terms of convergence rates, the prob- 
lems of estimating a + fjbx and fjbx are not intrinsically different. To 
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appreciate this point, define fi = E(X), let the functionals p and p be as in 
Section 2, and put q(x) = J x b(x — //) and q(x) = J 2 b(x — //). Given a random 
variable Z, write M(Z) = {EZ 2 ) 1 / 2 . Then 

\M{p(x) -p{x)} - M{q(x) - q(x)}\ 

(4.1) <M^{b-b){X -/.)+£-} 

< {E \\b - bf)y 2 {E\\X - H| 2 ) 1/2 + (Ee 2 ) 1 / 2 . 

Provided only that — b\\ 2 is bounded, the right-hand side of (4.1) equals 
0{n~ l l 2 ). Hence, (4.1) shows that, up to terms that converge to zero at the 
parametric rate n _1//2 , the rates of convergence of p(x) to p(x) and of q(x) 
to q(x) are identical. This result, and the fact that q(x) is identical to / bx 
provided x is replaced by x — fj,, imply that when addressing convergence 
rates in the prediction problem it is sufficient to treat estimation of Jjbx. 

4.2. Estimation of Jbx. Recall that our estimator of J bx is Jbx. Sup- 
pose the eigenvalues 9j in the spectral decomposition (2.1) satisfy 

(4.2) C- l r a <0j<Cr a i 0,-e j+1 >C- l r a ~ l forj>l. 

For example, if 6j = Dj~ a for a constant D > 0, then 9j — Oj+i ~ Da" 1 ^"" 1 , 
and so (4.2) holds. The second part of (4.2) asks that the spacings among 
eigenvalues not be too small. Methods based on a frequency cut-off m can 
have difficulty when spacings equal zero, or are close to zero. To appreciate 
why, note that if 0j+\ = • • • = 9j+k then ■ ■■ , 4>j+k are not individually 
identifiable (although the set of these k functions is identifiable) . In partic- 
ular, individual functions cannot be estimated consistently. This can cause 
problems when estimating Jj bx if the frequency cut-off lies strictly between 
j and j + k. 

Let Z have the distribution of a generic Xi — E(X{). Then we may 
write Z = J2j>iCj ( Pj^ where £j = J Zcftj is the jth principal component, or 
Karhunen-Loeve coefficient, of Z. We assume that all the moments of X are 
finite, and more specifically that 

for each r > 2 and each j > 1, E\^j\ 2r < C(r)9j, where C(r) does 

(4.3) not depend on j; and, for any sequence ji, ■ ■ ■ , j'4, E(£j 1 . . . £j 4 ) = 
unless each index jk is repeated. 

In particular, (4.3) holds if X is a Gaussian process. Let (3 > 1 and C\ > 0, 
and let 

(4.4) B = B(Ci ,[3) = lb:b = Y^ brfj , with \bj\< dj-P for each j > 1 i . 
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We can interpret B(C\,f3) as a "smoothness class" of functions, where the 
functions become smoother (measured in the sense of generalized Fourier 
expansions in the basis <j>i,(j>2, ■ ■ •) as /3 increases. We suppose too that the 
fixed function x satisfies 

oo 

(4.5) x = '^2 x j ( t > j with \xj \ < C2j~ 1 for each j. 

3=1 

Again, x becomes smoother in the sense of generalized Fourier expansions 
as 7 increases. 

Define mo = mo(n) by 

(nV 2 ^- 1 ), ifa+l<2 7 , 

(4.6) m = I (n/logn) 1 /(«+2/3-i) i if a + 1 = 2 7 , 

[ n l/(a+2/3-l) ) ifa+l>2 7 . 

These explicit values serve to simplify our discussion and our proof of The- 
orem 4.1, and do not reflect the wider range of values of to, particularly in 
the case a + 1 < 2 7 , for which our theory is valid. Discussion of this point 
has been given in Section 3. 

Recall the definition of b at (2.2). Given arbitrary positive constants C3, 
C4 and C5, let 

(4. 7) ~ b = fk if ||S|| <C 4 n », 

IC3, otherwise, 

where, for a function ip on I, \\ip\\ 2 = Jztp 2 - This truncation of b serves to 
ensure that all moments of b are finite. 



Theorem 4.1. Assume the eigenvalues 9j satisfy (4-2), that (^.3) holds 
and that all moments of the distribution of the errors £i are finite. Let a, 
(3 and 7 be as in {4-2), (4-4) an d {4-5), respectively. Suppose that a > 1, 
f3 > a + 2 and 7 > ^ , and that the ratio of m to mo is bounded away from zero 
and infinity as n — > 00. Then, for each given C, C±, . . . , C5 > 0, as n — > 00, 
the estimator b given in (^.7) satisfies 

(4.8) sup e( fbx- f bx) =0(r), 
beB{c x ,p) \Ji Ji J 

where r = r(n) is given by 

-\ z/a + l<2 7 , 

(4.9) T={n- 1 logn, if a + 1 = 2-/, 

- 2 (^ +7 -l)/(a+2^-l) j z / a + i>2 7 . 
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The smoothing-parameter choices suggested by (4.6) are different from 
those that would be used if our aim were to estimate b rather than Jjbx. 
In particular, to optimize the L2 convergence rate of b to b we would take 
m to be of size n 1 ^ a+2 ^ in each of the three settings addressed by (4.6). 
See, for example, [20]. In the critical cases where a + 1 > 2j, this provides an 
order of magnitude more smoothing than is suggested by (4.6). The intuition 
behind this result is that the integration step, in the definition Jj bx, provides 
additional smoothing no matter what level is used when constructing b, and 
so less smoothing is needed for b. 

The case a + 1 < 27 is more difficult to discuss in these terms, since 
a variety of different orders of magnitude of m can lead to the same optimal 
mean-square convergence rate of n _1 . Further discussion of this issue is given 
in Section 3. 

Of course, there are other related problems where similar phenomena are 
observed. Consider, for example, the problem of estimating a distribution 
function by integrating a kernel density estimator. In order to achieve the 
same parametric convergence rate as the empirical distribution function, we 
should, when constructing the density estimator, use a substantially smaller 
bandwidth than would be appropriate if we wanted a good estimator of the 
density itself. The operation of integrating the density estimator provides 
additional smoothing, over and above that accorded by the bandwidth, and 
so if the net result is not to be an oversmoothed distribution-function esti- 
mator then we should smooth less at the density estimation step. The same 
is true in the problem of prediction in functional regression; the operation of 
integrating bx provides additional smoothing, and so to get the right amount 
of smoothing in the end we should undersmooth when computing the slope- 
function estimator. A curious feature of the regression prediction problem is 
that, unlike the distribution estimation one, it is not always parametric, and 
in some cases the optimal convergence rate lies strictly between that for the 
nonparametric problem of slope estimation and the parametric ra -1 / 2 rate. 

4.3. Lower bounds. We adopt notation from Sections 4.1 and 4.2, and 
in particular take x = Ylj>i x j4'j to be a function and define B as at (4.4). 
Recall that the functions 4>j form an orthonormal basis for square-integrable 
functions on 2. Assume that, for a constant Cq > 1, 

C 6 _1 < j a 9j < C 6 and Cq 1 < f\ Xj \ < C 6 for all j > 1. 

Let T denote any estimator of T(b) = Jjbx, and define r = r(n) as at (4.9). 

Our main result in this section provides a lower bound to the convergence 
rate of T to T(b), complementing the upper bound given by Theorem 4.1 
in the case T = Jjbx, where b is given by (4.7). We make relatively specific 
assumptions about the nature of the model, for example that X is a Gaussian 
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process and the intercept, a, vanishes, bearing in mind that in the case of a 
lower bound, the strength of the result is increased, from some viewpoints, 
through imposing relatively narrow conditions. 

Theorem 4.2. Let a, and 7 be as in {4.2), (4-4) an d (4-5), respec- 
tively, and assume a, (3 > 1 and 7 > 5- Suppose too that the process X is 
Gaussian and that the errors £{ in the model (1.1) are Normal with zero 
mean and strictly positive variance; and take a = 0. Then there exists a con- 
stant C7 > such that, for any estimator T and for all sufficiently large n, 

sup E{f -T(b)} 2 >C 7 T, 
where r = r(n) is given as in (4-9). 



A comparison of the lower bound given above with the upper bound given 
in Theorem 4.1 yields the result that the minimax risk of estimating fbx 
satisfies 

/ * \2 (n' 1 , ifa + l<27, 

inf sup E[T— bx) x<n _1 logn, if a + 1 = 27, 

f beB(C x ,P) V J J ^ n _2(/3+ 7 -l)/(a+2/3-l) ) if a + 1 > 2 7 , 

where, for positive sequences a n and b n , a n X b n means that a n /b n is bounded 
away from zero and infinity as n — > oo. 



5. Proof of Theorem 4.1. 



5.1. Preliminaries. Define A = K — K, |||A||| 2 = f j2 A 2 and Sj = 
minfc<j(#fc — 0fc+i). It may be shown from results of Bhatia, Davis and 
Mcintosh [2] that 

sup \ 8j — 0j\ < HI A |||, 

(5.1) 

sup ^||^ -4>j\\ < 8 1/2 |||A|||. 
j">i 

For simplicity in our proof we shall take m = mo, as defined in (4.6). Note 
that in this setting m < n 1 ^ a+2 ^~ 1 ^ in each of the three cases in (4.6). 

Expand x with respect to both the orthonormal series (f>i,(f>2, ■ ■ ■ and 
(j>i, fa, ■ ■ ■ i obtaining x = J2j>i x j4 l j = Z)j>i%<^j; where Xj = fjxcpj and 
xj = JjX(j)j. Put gj = f x g(j)j. In this notation 

. m oo 

J (5 - b)x = ^{b jXj - bjXj) - h i x ii 

1 j=l j=m+l 



12 



T. T. CAI AND P. HALL 



whence it follows that 







J (b-b)x 


< 







3=1 



(5.2) 



+ 



E h i x i 

j=m+l 



+ 



Xj) 



m 

It is straightforward to show that | 



bj\\ x j 



^^hOlro-M). This 
quantity equals 0{(n -1 logn) 1 / 2 } if a+1 = 2j, equals O( n -(0+7-i)/(<*+2/3-i)) 
if a+ 1 > 2-f and equals o(n -1 / 2 ) otherwise. We shall complete the derivation 
of Theorem 4.1 by obtaining bounds for second moments of the other three 
terms on the right-hand side of (5.2). Our analysis will show that the first 
and second terms determine the convergence rate, and that the third and 
fourth terms are asymptotically negligible. In the arguments leading to the 
bounds we shall use the notation "const." to denote a constant, the value of 
which does not depend on b € B. In particular, the bounds we shall give are 
valid uniformly in b, although we shall not mention that property explicitly. 

5.2. Bound for \^j <m (bj — bj)xj\. Note that 



(5.3) bj - bj 



(5-4) gj - gj = g 3 - gj + / (g - g)(4>j ~4>j)+ / g 



Therefore, defining A„ = g — g, we have 



(5.5) 

If the event 
(5.6) 



9 j ~ 9 j 



<3||A C 



S = {\9j -9j\< \9j for all 1 < j < m} 



holds, then [9J 1 - 9j l \ < 2\9j - Oj\/Q? < 9J 1 . It can be proved, using this 
result, (5.1), (5.4) and (5.5), that if £ holds, 



3=1 



< 



(5.7) 



Y,^j-9j)x j 9 j 1 



+ IIIAIE 

3=1 



+ 



3=1 



'3 v j 
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+ 8 1 / 2 |||A|||^(||A 9 ||57i + | 5j |07i)|^|0-i. 
i=i 

For each real number r, define 

!m r+1 , if r > — 1, 
logm, ifr = — 1, 
1, ifr<-l. 

Standard moment calculations, noting that Si(g) = J2j< m (9j ~ 9j) x j^J 1 
may be expressed as a sum of n independent and identically distributed ran- 
dom variables with zero mean, show that E{S\(g) 2 } < const. n~ l t a ^2^{fn)i 
uniformly in g. Moreover, denoting by 82(g) the last term on the right-hand 
side of (5.7), we deduce that 

IIAIII^dlA^I^ + l^ie- 1 )^-^- 1 ! 

(5.8) 'J' 

< const. {n 2 t2 a - 1 +i(rn) 2 + n 1 i a _ / g__ 7 (m) 2 }. 

If (3 > 7 then t Q _ j a_ 7 (m) < t a -.2-y(m), and if (3 < 7 then, since /? > \(a + 1), 
a — P — 7 < —1, implying that t Q „ j a_ 7 (m) < const. i a _2 7 (fn). Moreover, 
i2a-7+i("i) < const .t a _2 7 (m)m a+1 , and by assumption, n > m a+1 . There- 
fore, n~ l t2a-^+i(m) < const. i a _2 7 (m). Hence, (5.8) implies that E{S2{g) 2 } < 
const.n _1 t Q _2 7 (?7T-). Combining this bound with that for E{S\(g) 2 }, and 
with (5.7), and writing I(T) for the indicator function of any subset T C £, 
we deduce that 



E 



(5.9) 



< const. E 



2i 



+ E 



mm 



ij=i 



n 1 t a ^2-i{m) 



Note too that if £ holds, 



(5.10) 



III III s p Z\ 

Y J {b j -b j ) 2 < C o^t.Y^ef\(g j -g 3 ) 2 + gfa-h) 
3=1 i=i 1 JI ' 



+ const.|||A||| 2 {||A g || 2 i 4Q , +2 (m) + t 2a -2f3(m)}, 
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E^ 
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m 



E(% - x j) 2 = E / ^ - 

Let p = g or x, and define 7r = a + /3 and n = 7 in the respective cases. 
Let gi,g2,- ■• denote constants satisfying \qj\ < const .j K for each j, where 
k = a — 7 if p = g, and k = —(a + /?) if p = x. Given 7/ > 0, consider the 
event 

T= {||| A||| ^n^ 1 / 2 ) and 

(5.13) 

|0j - %| < 5C .T for all 1 < j < m}. 
Comparing (5.6) and (5.13), and noting (4.2), we see that T 'C£. We shall 
show in Section 5.5 that, uniformly in 1 < j < const. n l ^ a+l \ 

(5.14) E\I{F) J^-^)' 2 
and also, 

(5.15) E 



< const. n 1 j a (l+j 



•2a+2-27n 



< const. n 1 t2 K -a{fn)- 



Next we use (5.15) to bound the first term on the right-hand side of (5.9): 



(5.16) E 



< const. n t 



a— 27 



m . 



To bound the second term, it can be proved from (5.14) that 

^ 2t 



(5.17) 



9{<Pj 



< const. n -2{/3-a-(3/2)}/(ar+2/3-l) _ 

Going back to the definition of T at (5.13), and taking 7/ < {f3 — a — (3/2)}/(a + 
2(3 — 1), we deduce from (5.17) that 

^ 2n 



(5.18) E /(^)|||A||| 2 ^ / gWj-to; 
L ij=i 

Results (5.9), (5.16) and (5.18) imply that 

( m. ^ 2 



< const. n 



(5.19) 



WjE^-^ 



< const. n 1 t Q ,_2 7 (m). 
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5.3. Bounds for \ J2j<m. ^j( x j 



f)| and J2j< m \bj — bj \\xj — Xj \ . Noting 
that k = — (a + /?) when p = x, we may also use (5.15) and (5.14) to bound 
the expected values of the squares of the right-hand sides of (5.11) and 
(5.12), respectively, multiplied by /(J 7 ): 

^ 2 n 



(5.20) E 



(5.21) 



lj=i J 



< const. n 



< const. n 1 t a+ 3_2 7 (m). 



Noting that (3 > a + 2 and E(gj 
(5.10) and (5.14) that 



9j 



(5.22) 



From (5.21) and (5.22) it follows that 



) 2 < const. n we can show from 



< const. n 1 m a+l . 



2-i 



(5.23) 



(m 



< £ I(^) £(6,- - 6,) 2 U J(.F) - x,) 5 



< const. n 1 m a+1 ■ n 1 t Q , + 3_2 7 (m) < const. n 



5.4. Completion of the proof of Theorem 4.1. Combining (5.2), (5.19), 
(5.20) and (5.23) we deduce that 

2l 



(5.24) 



E 



(b-b)x 



< const. n t a _2 7 (m). 



The proof of Theorem 4.1 will be complete if we show that the factor I (J 7 ) 
can be removed from the left-hand side. Since, in view of (4.7), our estimator 
b satisfies ||6|| < C^ 5 , then it suffices to prove that, for all D > 0, P{T) = 
1 — 0(n~ D ). Now the first part of (5.1) and (5.13) imply that if we define 

G = {||| A||| < min^-^cCr 1 ™- - 1 )}, 

then g C T. Since m < n 1/(Q+2/5_1) and 2 (a + 1) < a + 2/3 - 1, then for some 
r/ > 0, m~ a ^ 1 > n* 7 "(V 2 ) . Therefore, if £ > is sufficiently small, there exists 
no > 1 such that, if we define 7i = {|||A||| < ji^ 1 / 2 -*}, then for all n > no, 
HC$. Since we assumed all moments of the principal components £j and 
the errors to be finite, then Markov's inequality is readily used to show 
that P(H) = 1 - 0{n~ D ) for all D > 0. It follows that P{f) = 1 - 0(n~ D ), 
and so (5.24) implies (4.8). 
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5.5. Proof of (5.14) and (5.15). Define Aj by 
(5.25) fa(t) = 4>j{t) + (°3 ~ ^)~Vfc(*) / &<t>j<l>k + 
It may be proved that 



fa -<Pj= E $3 ~ °k) Vfc / MjA + 4>j [ (fa - 

k:k+j J Jl 

from which it follows that 

A i = E ~ e k)~ X ~ ( e i ~ hY 1 }^ I ^fa<t>k 

k:k^j Jl 

+ E ^ " e *)~Vfc / A (fa - <t>j)4>k + 4>i h 

If J 7 holds then so too does the event £ and, in view of (4.2), \9j — 9^\ < 
2\6j — 9k\ for all 1 < j < m and all k 7^ j. Therefore, writing p = ^2j>\Pj(j>j 
and using (5.1), we deduce that 



<2\9j-Bj\{ E ^-9 k yy k \ ' \\Afa\ 

Vk-.k^j ) 



(5.26) 



+ 



1/2 



+ E (% - KrVk 
ik-.k^j 



Hfa-h) 



Since \pj \ < const .j n for each j then, if d = 2 or 4, 

E " ^rti < ™nst.{t ad _ 2n (j) + ^+" } 

fe : kjtj 

< const. (l+j ad+d - 27r ). 

Moreover, ||A^-|| < ||A^-|| + ||A(<^- - 0)||, E\\A^j\\ 2 < const. n -1 ^-, and if ^ 
holds, ||A((^j — 0)|| < const. I A 1 2 We shall show in Section 5.6 that 



if 77, in the definition of at (5.13), is chosen sufficiently 
small, then whenever T holds, \fj((j>j — <j)j)4>j\ < Co a? for 
(5.27) 1 < j < m, where Cq > is a constant depending on neither 
j nor n, and a,- is a nonnegative random variable satisfying 
£(a 2 ) < n" 2 j 4 . 
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Combining (5.26) and the results in this paragraph, we deduce that 
E{I(T)[ I pAj 



(5.28) 



Note too that 



< const. {n~ 2 r a (l + n~ 1 j 3a+2 )(l + j 4a + 4 ~ 27r 



Kk-.k^j J 



(5.29) 




.k:k^j ) 

< const. nr 1 j" a (l+j 2a+2 ~ 27r 



A0, 



When p = g we may substitute 7r = a + /? into (5.28). Then we can de- 
duce from (5.28) that, assuming a + 2 < (3 as well as the bound j < m < 
n i/(a+2/3-i) ^ ^Yiq right-hand side of (5.28) is bounded above by a constant 
multiple of n~ 1 j~ a . Since (3 > 1 then this bound also applies to the right- 
hand side of (5.29). 

In the case p = x the fact that a + 2 < /3, as well as the bound j <m < 
n i/(a+2/3-i) ^ j m pjy that the right-hand side of (5.28) is dominated by the 
right-hand side of (5.29). Hence, for both p = g and p = x the bound at 
(5.14) follows from (5.25), (5.28) and (5.29). 

Observe too that by (5.28) 



(5.30) 



<mconst.{n 2 t a+2K+ i(m) + n 2 t Za+2K+ i--2 7T {m) 

+ Tl 3 t2a+2K+2( rr >') + n 3 *6a+2ft+6-27r}- 



Now, k — ir = — (/3 + 7) if p = g, and it equals — (a + (3 + j) if p = x. Therefore, 
if p = g then 3a + 2k + 4 - 2tt = 3a + 4 - 2{(3 + 7) < (a + 2/3 - 1) - 1, and 
6a + 2k + 6 - 2vr = 2{3a + 3 - (J3 + 7)} < 2(a + 2(3 - 1) - 1. [We subtract 
the extra 1 to account for the factor m on the right-hand side of (5.30).] 
These two results, and the fact that m a + 2 l 3 ~ 1 < n ^ imply that the terms in 
mn~ 2 i 3Q+ 2 K +4-27r("i) and mn~ 3 t ea+ 2 K+ e-2n in (5.30) may be replaced by 
n^ 1 without affecting the validity of the bound when p = g. Furthermore, 
when p = g, a + 2k + 1 = 3a - 27 + 1 < (a + 2(3 - 1) - 1 and 2a + 2k + 2 = 
4a — 27 + 2 < 2(a + 2(3 — 1) — 1, and so the terms in mn~ 2 t a+ 2 K +i(m) and 
mn~ 3 t2 a +2K+2 may also be replaced by n . Therefore the right-hand of 
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(5.30) may be replaced by n~ 1 when p = g. An identical argument shows 
this also to be the case when p = x. Hence, in either setting, 

(5.31) E^Ii^jr^qj J pAjj < const.™" 1 . 
Using (4.3) it can be proved that 

E E (Oj-O^qjPk [ A^<p k \ < const. t 2K - a {m). 

Combining (5.25), (5.31) and (5.32) we obtain (5.15). 

5.6. Proof of (5.27). It may be proved from (5.25) that \\cj)j — <pj\\ 2 = 



+ , where 



= E 0i - **r 2 0?fc. = {/ (fa - <i>i)<i>i} 



k : k^j 

and Wjk = f A(pj(pk- Since both <pj and cj)j are of unit length then = 
2{1 — (1 — u 2 ) 1 / 2 } — u 2 , which implies that 

(5.33) for all j > 1, \\4>j - f < 2u 2 , v) < u). 

If the event J- obtains then \9j — #fc| _1 < 2\0j — 0k\ for all j, k such that 
j 7^ k and 1 <j <m. For the same range of values of j and k, \6j — 9k\ < 
D9~^m. Here D = C 2 , where C is as in (4.2). Defining &jk = / Afijcftk and 
yjk = f A((f>j - 4>j)4>k, we have w 2 k < 2(x 2 k + y? k ), and hence, assuming T 
holds, we have for 1 < j < m, 



u 



2 <8 E (0j - 6 k r\x% + y] k ) <8Aj + 8D 2 0- 2 m 2 c 



k : k^j 

(5-34) 

< 8Aj + 8Z) 2 6>" 2 m 2 1 A I 2 - faf, 

where Aj = Ek-.k^j( d j ~ 6 k)~ 2 x 2 k and cj = Ek-.k^j V% ^ III A !P ll<A? ~ 0jll 2 - 

Condition (4.3) implies that nE(x 2 k ) < const. 6^6^, where the constant 
does not depend on j, k or n. Moreover, 

Therefore, E(Aj) < const. n~ l j 2 for 1 < j < m, and similar calculations show 
that 

(5.35) E(A 2 )<D{n- 2 j\ 
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where D\ > depends on neither j nor n. 

Combining (5.34) with the first part of (5.33) we deduce that if T holds, 

(5.36) ||^ - 4>,f < 16A, + 16D 2 6»- 2 m 2 |A|| 2 ||^ - <^|| 2 

for 1 < j < m. However, if c > is given, and if ij > is chosen sufficiently 
small in the definition of T at (5.13), then for all sufficiently large m, J- 
implies |||A||| < cm m . Hence, by (5.36), if J~ holds, th.cn for 1 ^ j ^ ?n, 

(i - i6D 2 c 2 )||^ - (pjf < ieAj. 

Choosing c so small that 16D 2 c 2 < |, we deduce that if T holds, then for 
1 < j < m, \\c/>j — 4>j\\ 2 < 32Aj. Combining this result with (5.34), and noting 
the choice of c, we deduce that if T holds, then for 1 < j < m, uj < 16Aj. 
From this property and the second part of (5.33) we conclude that if T 
holds, then for 1 < j < m, 

<u 2 < \\4>j -<t>jf <32Aj. 

Taking a,j = D^[ 1 Aj, where D\ is as at (5.35), and letting Cq = 32D\, we see 
that (5.27) follows from (5.35) and (5.37). 

6. Proof of Theorem 4.2. We shall treat only the cases 27 < a + 1 and 
27 = a + 1, since the third setting, 27 > a + 1, is relatively straightfor- 
ward. For notational simplicity we shall assume that C\, in the definition 
of B(Ci,(3), satisfies C\ > 1, and take 8j = j~ a and xj =j -7 . More general 
cases are easily addressed. 

Since X is Gaussian then we may write Xi = 2~^>i£ij'<Aj f° r * — 1> where 
the variables £y are independent and normal with zero mean and respective 
variances 9j for j > 1. Define v to be the integer part of n 1 /^ a+2 ^~ l \ and let 
Bq = and B\ = Ylv+i<j<2u3~^ ( t ) j^ both are functions in B(C\,f3). 

Note that T(Bq) = and that for large n, 

(6.1) r(J3i) > const.n"^- 1 )/^ 2 ' 3 - 1 ), 

where, here and below, "const." denotes a finite, strictly positive, generic 
constant. Write Hj = J2v+i<j<2is£,ijj~^ ■ The observed data are Y{ = tEi + £j 
for 1 < i < n, where t = or 1 according as b = Bq or b = B\ , respectively. 
Denote by Pt the joint distribution of the Y^s for t = or 1. Elementary 
calculations show that the chi-squared distance between Pq and Pi is given 

by 

where a 2 denotes the variance of the error distribution. 
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The variables S, are independent and normally distributed with zero 
means and variance V n , where nV n = TiJ2v+i<j<2v j~ a ~ 2 ^ — > const, as n — > 
co. Indeed, 

(6.2) Ex{d(P , Pi)} -► const, 

where Et denotes expectation in the model with b = Bt, for t = or 1. Let 
T be any estimator such that for some D > 0, 

(6.3) E {f - T(B )} 2 < £, n -2(/3+7-l)/(a+2/3-l) . 
Put 

_ 2[E {f-T(B )} 2 E 1 {d(p ,p 1 )}] 1 / 2 

9 \T(B 1 )-T(B )\ 

It follows from (6.1), (6.2) and the fact that T(B ) = 0, that if D in (6.3) is 
chosen sufficiently small, p < ^. In this case, 

Ei{T - T(Px)} 2 > {r(J3i) - T(B )} 2 (1 - p) 

(6.4) 

^const.n" 2 ^- 1 )/^- 1 ), 

where the first inequality follows from the constrained-risk lower bound of 
Brown and Low [4], and the second uses (6.1) and the property T(Bq) = 0. 
Consequently, writing E^ for expectation when the slope function is b € B, 
for any estimator T 



supE b {T - T(b)} 2 > m&xE t {T - T(B t )} 2 > const.n 



-2(/3+ 7 -l)/(a+2/3-l) 



The case 2j = a+1 may be treated similarly, by taking u = (n/ logn) 1 ^ a+2 ^ ^ 
and replacing n by n/logn in (6.1), (6.3) and (6.4). 
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